<<<<<<< HEAD
## Creating a generic function for 'toJSON' from package 'jsonlite' in package 'googleVis'
======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f

Classification using speed dating dataset

The “Business Decision”

We would like to understand who would be the most likely champion of speed dating as well as what would be the key drivers that affect people’s decision in selecting.

The Data

<<<<<<< HEAD

We used the data from speed dating experiment conducted at Columbia Business School available on kaggle This is how the first 5 out of the total of 1346 rows look:

=======

We used the data from speed dating experiment conducted at Columbia Business School available on kaggle This is how the first 5 out of the total of 1622 rows look:

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
01 02 03 04 05
attr_o 6 6 10 6
sinc_o 75
sinc_o 8 7 10 8 9
intel_o 8 10 10 6 9
fun_o87 10 8 7
amb_o 7 6 10 10 9
shar_o65 10 10 5
field_cd 1 1 1 1 1
race 2 2 2 2 2
goal2 2 2 2 2
date 4 4 4 4 4
go_out1 1 1 1 1
date 5 5 5 5 5
go_out 1 1 1 1 1
career_c 1 1 1 1 1
sports1 1 1 1 1
career_ctvsports1 1 1 1 1
sportsexercise 6 6 6 6 6
dining7 7 7 7 7
tvsports 4 4 4 4 4
exercise
museums 6 6 6 6 6
art7 7 7 7 7
dining
hiking7 7 7 7 7
museums 6 6 6 6 6
art 8 8 8 8 8
hiking 6 6 6 6 6
gaming 6 6 6 6 6
clubbing 8 8 8 8 8
reading 6 6 6 6 6
tv
gaming 5 5 5 5 5
clubbing 7 7 7 7 7
reading 7 7 7 7 7
tv 7 7 7 7 7
theater 9 9 9 9 9
movies 7 7 7 7 7
concerts8 8 8 8 8
theater 6 6 6 6 6
movies 6 6 6 6 6
concerts 3 3 3 3 3
music 7 7 7 7 7
shopping1 1 1 1 1
yoga8 8 8 8 8
yoga 3 3 3 3 3

A Process for Classification

Classification in 6 steps

We followed the following approach to proceed with classification (as explained in class)

  1. Create an estimation sample and two validation samples by splitting the data into three groups. Steps 2-5 below will then be performed only on the estimation and the first validation data. You should only do step 6 once on the second validation data, also called test data, and report/use the performance on that (second validation) data only to make final business decisions.
  2. Set up the dependent variable (as a categorical 0-1 variable; multi-class classification is also feasible, and similar, but we do not explore it in this note).
  3. Make a preliminary assessment of the relative importance of the explanatory variables using visualization tools and simple descriptive statistics.
  4. Estimate the classification model using the estimation data, and interpret the results.
  5. Assess the accuracy of classification in the first validation sample, possibly repeating steps 2-5 a few times in different ways to increase performance.
  6. Finally, assess the accuracy of classification in the second validation sample. You should eventually use/report all relevant performance measures/plots on this second validation sample only.

Let’s follow these steps.

Step 1: Split the data

We have three data samples: estimation_data (e.g. 80% of the data in our case), validation_data (e.g. the 10% of the data) and test_data (e.g. the remaining 10% of the data).

<<<<<<< HEAD

In our case we use 1076 observations in the estimation data, 135 in the validation data, and 135 in the test data.

=======

In our case we use 1297 observations in the estimation data, 162 in the validation data, and 163 in the test data.

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f

Step 2: Choose dependent variable

Our dependent variable is: dec_o. It states whether given subject was selected by the partner. In our data the number of 0/1’s in our estimation sample is as follows.

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Class 1 Class 0
# of Observations547 529572 725

while in the validation sample they are:

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Class 1 Class 0
# of Observations77 5860 102

Step 3: Simple Analysis

Below are the statistics of our independent variables across the two classes, class 1, “selected”

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
min 25 percent median mean 75 percent max std
attr_o 16.0 7 7.37 8.0 10 1.51
sinc_o 0 7.0 8 7.55 8.5 10 1.566 7 7.35 8 10 1.45
sinc_o 3 7 8 7.65 9 10 1.44
intel_o 37.0 8 7.65 8.0 10 1.32
fun_o 0 6.0 7 7.34 8.0 10 1.62
amb_o 3 6.0 7 7.14 8.0 10 1.597 8 7.74 9 10 1.26
fun_o 2 6 7 7.29 8 10 1.47
amb_o 2 6 7 7.05 8 10 1.60
shar_o 0 5.0 76.48 8.0 10 1.896.51 8 10 1.71
field_cd 13.0 8 6.71 10.0 16 4.414 8 7.10 9 17 3.62
race 1 2.0 22.66 4.0 6 1.282 2.60 3 6 1.21
goal 11.0 2 1.95 2.0 6 1.281 2 2.34 3 6 1.54
date 14.0 5 4.92 6.0 7 1.61
go_out4 5 4.84 6 7 1.35
go_out 11 1.0 22.04 3.0 6 0.981.91 2 6 0.99
career_c 1 2.0 44.99 7.0 15 3.484.97 7 17 3.36
sports 14.0 7 6.04 8.0 10 2.785 7 6.59 9 10 2.35
tvsports 12.0 4 4.50 7.0 10 2.952 4 4.50 7 10 2.63
exercise 15.0 7 6.63 9.0 10 2.50
dining 4 7.0 8 7.91 9.0 10 1.79
museums 2 7.0 7 7.24 9.0 10 1.87
art 2 6.0 7 6.94 8.0 10 1.96
hiking 0 4.0 7 5.97 8.0 10 2.555 6 6.27 8 10 2.21
dining 3 7 8 7.75 9 10 1.69
museums 3 5 7 6.70 8 10 2.13
art 1 4 6 6.42 9 10 2.57
hiking 1 3 6 5.63 8 10 2.50
gaming 1 1.0 23.30 5.0 14 2.594 3.93 5 14 2.15
clubbing 14.0 6 5.79 8.0 10 2.63
reading 2 7.0 8 7.84 9.0 10 1.964 7 6.12 8 10 2.40
reading 1 6 8 7.44 9 10 1.99
tv 12.0 6 5.15 7.0 10 2.833 5 4.92 7 10 2.23
theater 15.0 8 6.98 9.0 10 2.385 7 6.56 9 10 2.43
movies 2 7.0 88.03 10.0 10 1.86
concerts 1 6.0 7 7.18 9.0 10 2.15
music 1 7.0 8 8.00 10.0 10 1.817.85 9 10 1.82
concerts 2 5 7 6.77 8 10 2.00
music 4 7 8 7.78 9 10 1.73
shopping 14.5 7 6.30 8.04 6 5.54 710 2.70
yoga 12.0 4 4.54 7.0 10 2.842 4 4.30 7 10 2.70

and class 0, “not selected”:

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
min 25 percent median mean 75 percent max std
attr_o0 4 6 5.55 7 10 1.851 4 6 5.38 7 10 1.80
sinc_o 0 6 7 6.84 8 101.85
intel_o 01.84
intel_o 16 7 7.04 8 101.651.56
fun_o 0 5 65.82 7 11 1.99
amb_o 0 5 6 6.55 8 10 1.835.67 7 10 1.94
amb_o 1 5 7 6.50 8 10 1.81
shar_o 0 3 54.85 6 10 2.054.72 6 10 2.06
field_cd 13 8 6.56 10 16 4.135 8 7.27 10 17 3.40
race 1 2 22.64 4 6 1.152.69 4 6 1.31
goal 1 1 22.10 2 6 1.382.58 4 6 1.69
date 1 46 5.33 7 7 1.485 5.04 6 7 1.26
go_out 1 1 22.23 3 6 1.232.18 3 6 1.22
career_c 1 23 4.58 7 15 3.206 5.11 7 17 3.31
sports 13 6 5.76 8 10 2.735 7 6.48 9 10 2.57
tvsports 1 2 44.44 7 10 2.874.54 7 10 2.99
exercise 1 57 6.68 8 10 2.39
dining 4 7 8 8.02 9 10 1.636 5.93 8 10 2.26
dining 3 6 8 7.68 9 10 1.76
museums 26 8 7.37 9 10 1.90
art 2 5 8 7.08 8 10 2.02
hiking 0 3 6 5.52 7 10 2.67
gaming 1 1 3 3.36 5 14 2.495 7 6.52 8 10 2.23
art 1 4 6 6.34 9 10 2.45
hiking 1 3 6 5.67 8 10 2.69
gaming 1 2 4 4.02 6 14 2.66
clubbing 1 4 65.765.718 10 2.27
reading2 7 8 8.07 9 10 1.681 7 8 7.63 9 10 2.15
tv 1 36 5.67 8 10 2.865 4.91 6 10 2.06
theater 1 5 7 6.35 87.48 910 2.19
movies 28 9 8.34 10 10 1.717 8 8.00 9 10 1.61
concerts 1 5 7 6.68 87.26 910 2.15
music1 7 8 7.85 9 10 2.034 7 8 7.79 9 10 1.56
shopping 1 3 5 5.18 76.52 9 10 2.5310 2.62
yoga 1 2 44.56 7 10 2.934.28 7 10 2.82

A simple visualization of values is presented below using the box plots. These visually indicate simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc). For example, for class 0

<<<<<<< HEAD

and class 1:

=======

and class 1:

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f

Step 4: Classification and Interpretation

For our assignent, we used three classification methods: logistic regression, classification and regression trees (CART) and machine learning (i.e. random forests).

Running a basic CART model with complexity control cp=0.01, leads to the following tree:

<<<<<<< HEAD

=======

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f

Where the key decisions criteria could be explained by the following table.

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Attribute Name
IV1 attr_o
IV4 fun_o
IV6 shar_o
IV2 sinc_o
IV3 intel_o
IV5 amb_o
IV27 music
IV12 career_c
IV26 concerts
IV7 field_cd
IV25 movies
IV16 dining
IV22 reading
IV15 exercise
IV21 clubbing
IV24 theaterIV21 clubbing
IV5 amb_o
IV14 tvsports
IV24 theater
IV17 museums
IV16 dining
IV18 art
IV13 sports

For example, this is how the tree would look like if we set cp = 0.005:

<<<<<<< HEAD

=======

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Attribute Name
IV1 attr_o
IV4 fun_o
IV6 shar_o
IV2 sinc_o
IV3 intel_o
IV5 amb_o
IV7 field_cd
IV16 dining
IV12 career_c
IV15 exercise
IV27 music
IV26 concerts
IV23 tvIV21 clubbing
IV5 amb_o
IV14 tvsports
IV27 music
IV18 art
IV28 shopping
IV13 sports
IV17 museums
IV25 movies
IV22 reading
IV19 hiking
IV29 yoga
IV24 theater
IV21 clubbing
IV8 race
IV9 goal
IV11 go_outIV20 gaming
IV17 museums
IV7 field_cd
IV13 sports
IV29 yoga
IV24 theater
IV26 concerts
IV16 dining
IV19 hiking
IV10 date
IV28 shopping
IV25 movies
IV22 reading
IV8 race

Below we present the probability our validation data belong to class 1. For the first few validation data observations, using the first CART above, is:

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Actual Class Probability of Class 1
Obs 1 1 0.65
Obs 21 0.80
Obs 3 1 0.80
Obs 4 1 0.28
Obs 5 1 0.310 0.65
Obs 3 0 0.84
Obs 4 0 0.21
Obs 5 0 0.21

Logistic Regression is a method similar to linear regression except that the dependent variable can be discrete (e.g. 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD
Estimate Std. Error z value Pr(>|z|)
(Intercept)-2.4 1.0 -2.5-5.7 1.0 -5.40.0
attr_o 0.6 0.19.910.50.0
sinc_o-0.2 0.1 -2.90.0 0.1 0.3 0.8
intel_o0.0 0.10.1 0.1 0.7 0.5
fun_o0.3 0.8
fun_o 0.2 0.14.14.20.0
amb_o-0.2 0.1 -3.1-0.3 0.1 -4.30.0
shar_o 0.3 0.06.16.60.0
field_cd 0.0 0.0-1.7 0.1-0.4 0.7
race 0.1 0.1 1.3 0.2
goal 0.0 0.1-0.3 0.8
date -0.1 0.1 -2.1 0.0
go_out 0.1 0.1 1.5 0.1
-0.5 0.6
date 0.0 0.1 0.1 0.9
go_out -0.1 0.1 -0.7 0.5
career_c 0.0 0.01.6 0.1
sports 0.0 0.0 -0.1 0.9
tvsports-0.9 0.4
sports0.0 0.0 0.5 0.6
exercise -0.1
tvsports0.0 -2.4 0.0
dining
exercise 0.10.0 1.7 0.1-0.1 0.9
dining -0.1 0.1 -1.4 0.2
museums 0.1 0.10.5 0.6
art1.5 0.1
art -0.10.1 -1.3 0.2
hiking 0.0 0.0 0.1 0.9 0.4
hiking
gaming0.0 0.0 -0.1 0.9
gaming 0.0 0.0 0.7 0.5
clubbing -0.1 0.0 -1.8
clubbing 0.0 0.0 -1.1 0.3
reading 0.0 0.0 0.0 1.0
tv 0.0 0.0 -0.6 0.6
theater 0.0 0.0 0.8 0.4
movies -0.10.1 -0.9 0.3
reading -0.1 0.1 -1.8 0.1
tv -0.1 0.0 -1.3 0.2
theater -0.1 0.1 -2.1 0.0
movies 0.0 0.1 -0.5 0.6
concerts 0.0 0.1 -0.3 0.8
music 0.0 0.1 0.6 0.5
shopping 0.0 0.0 0.1 0.9
yoga 0.0 0.0 -1.0 0.3

Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making.

======= concerts 0.0 0.1 0.5 0.6 music 0.0 0.1 -0.6 0.5 shopping 0.1 0.0 2.4 0.0 yoga 0.0 0.0 -1.3 0.2

Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making.

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f

Beloow table shows us key drivers of the classification according to each of the used methods.

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
CART 1 CART 2 Logistic Regr. Random Forests - mean decrease in accuracy
attr_o 1.00 1.00 1.00 1.00
sinc_o-0.30 -0.33 -0.29 0.20
intel_o 0.30 0.31 0.03 0.17
fun_o 0.54 0.54 0.41 0.59
amb_o -0.27 -0.31 -0.31 0.11
shar_o 0.30 0.35 0.62 0.65
field_cd -0.02 -0.08 -0.17 0.17
race 0.00 0.01 0.13 0.09
goal 0.00 -0.01 -0.03 0.09
date 0.00 0.00 -0.21 0.18
go_out 0.00 0.00 0.15 0.09
career_c 0.04 0.08 0.16 0.16
sports 0.00 -0.03 -0.01 0.14
tvsports 0.00 0.00 0.05 0.10
exercise -0.01 -0.08 -0.24 0.19
dining -0.02 -0.08 -0.01 0.13
museums 0.00 0.03 0.05 0.13
art 0.00 0.04 0.09 0.13
hiking 0.00 0.02 0.08 0.16
gaming 0.00 0.00 0.07 0.11
clubbing -0.01 -0.01 -0.18 0.10
reading -0.01 -0.02 -0.18 0.14
tv 0.00 -0.04 -0.13 0.20
theater -0.01 -0.01 -0.21 0.16
movies -0.02 -0.02 -0.05 0.15
concerts -0.03 -0.05 -0.03 0.14
music 0.06 0.06 0.06 0.15
shopping 0.00 0.04 0.01 0.160.23 0.24 0.03 0.17
intel_o 0.20 0.21 0.07 0.18
fun_o 0.48 0.48 0.40 0.53
amb_o -0.05 -0.05 -0.41 0.15
shar_o 0.44 0.43 0.63 0.66
field_cd 0.00 -0.02 -0.04 0.14
race 0.00 0.00 0.12 0.13
goal 0.00 0.00 -0.05 0.20
date 0.00 0.01 0.01 0.17
go_out 0.00 0.00 -0.07 0.11
career_c 0.00 0.00 -0.09 0.13
sports 0.00 0.00 0.00 0.18
tvsports -0.03 -0.05 -0.10 0.18
exercise 0.00 0.00 0.16 0.18
dining -0.01 -0.01 -0.13 0.16
museums 0.01 0.03 0.14 0.17
art -0.01 -0.03 -0.12 0.19
hiking 0.00 0.01 0.01 0.18
gaming 0.00 -0.03 -0.01 0.17
clubbing -0.17 -0.17 -0.10 0.13
reading 0.00 0.00 0.00 0.18
tv 0.00 0.00 -0.06 0.14
theater 0.01 0.01 0.08 0.21
movies 0.00 -0.01 -0.09 0.14
concerts 0.00 0.01 0.05 0.15
music 0.00 -0.04 -0.06 0.20
shopping 0.00 0.01 0.23 0.22
yoga 0.00-0.02 -0.10 0.15-0.01 -0.12 0.13

In general we do not see very significant differences across all used methods which makes sense.

Step 5: Validation accuracy

1. Hit ratio

Below is the percentage of the observations that have been correctly classified (the predicted is the same as the actual class), i.e. exceeded the probability threshold 50% for the validation data:

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Hit Ratio
First CART65.18519
Second CART 68.14815
Logistic Regression 74.07407
Random Forests 67.4074165.43210
Second CART 66.66667
Logistic Regression 70.98765
Random Forests 68.51852

while for the estimation data the hit rates are:

<<<<<<< HEAD
Hit Ratio
First CART74.34944
Second CART 79.55390
Logistic Regression 74.72119
Random Forests 99.07063

A simple benchmark to compare the performance of a classification model against is the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is people who do not intent do purchase a boat: 58 out of 135 people). Clearly without doing any discriminant analysis, if we classified all individuals into the largest group, we could get a hit-rate of 42.96% - without doing any work.

======= 76.17579 Second CART 78.33462 Logistic Regression 75.79029 Random Forests 99.38319

A simple benchmark to compare the performance of a classification model against is the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is people who do not intent do purchase a boat: 102 out of 162 people). Clearly without doing any discriminant analysis, if we classified all individuals into the largest group, we could get a hit-rate of 62.96% - without doing any work.

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f

In our case this particular criterion is met for all the methods that were used.

2. Confusion matrix

The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression, 2 CART models and random forests), the confusion matrix for the validation data is:

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Predicted 1 Predicted 0
Actual 189.61 10.39
Actual 0 53.45 46.5558.33 41.67
Actual 0 78.43 21.57

3. ROC curve

The ROC curves for the validation data for all four methods is below:

<<<<<<< HEAD

=======

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f

4. Lift curve

The Lift curves for the validation data for our four classifiers are the following:

<<<<<<< HEAD

=======

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f

Step 6: Test Accuracy

Below are presented hit ratios for all four methods based on test dataset:

<<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Hit Ratio
First CART74.81481
Second CART 69.62963
Logistic Regression 68.14815
Random Forests 72.5925971.77914
Second CART 70.55215
Logistic Regression 76.07362
Random Forests 74.23313

The Confusion Matrix for the model with the best validation data hit ratio above:

<<<<<<< HEAD
Predicted 1 Predicted 0
Actual 167 33
Actual 0 69 31

ROC curves for the test data:

Lift Curves for the test data:

======= 74 26 Actual 0 78 22

ROC curves for the test data:

Lift Curves for the test data:

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f